Analyzing Data with PySpark

"book.txt" is the text file or a digital copy of the book The Hound of the Baskervilles, by Arthur Conan Doyle

book is now a RDD

llist = book.collect() for line in llist: print(line)

Printing the Elements of the arrys given out by the collect() RDD function

Frequency of words Before Processing them and using the RDD Function as given in the Exaample of this Lab

Spark map() transformation applies a function to each row in the RDD

Spark flatMap() transformation flattens the DataFrame/Dataset after applying the function on every element and returns a new transformed Dataset

common stop Words in English , English Stop-word list referred from "https://gist.githubusercontent.com/sebleier/5542/raw/7e0e4a1ce04c2bb7bd41089c9821dbcf6d0c786c/NLTK's%2520list%2520of%2520english%2520stopwords" and added a few more words by inspecting data.

Removing the stop words from the RDD by usuing the filter RDD function

As seen above we can observe that the stop words have been removed; rdd3 does not have words 'the' , 'of' ...

using RDD Function to group words having same two letters at the start

using RDD functions to collect the word hound and checking its frequency

Making the bag of words (word frequency) usuing the mapreduce functions

Creating a spark dataframe using the frequency of words Rdd

Pandas data frame for the frequency of Top 20 words in the corpus without the stop words

Bar chart of the Top 20 words in the corpus , from the above chart we can obserse that 'upon' is the most common word used in the corpus next to that is the word 'sir'

The same can be observed in the above scatter plot that "upon" is the most common word in the corpus.

Commom Noun (NN) has the highest count in the corpus and prural common noun(NNS) is the second highest

Sources: